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Abstract: We study the nonparametric covariance estimation of a stationary Gaussian field X 
observed on a regular lattice. In the time series setting, some procedures like AIC are proved 
to achieve optimal model selection among autoregressive models. However, there exists no such 
equivalent results of adaptivity in a spatial setting. By considering collections of Gaussian Markov 
random fields (GMRF) as approximation sets for the distribution of X, we introduce a novel 
model selection procedure for spatial fields. For all neighborhoods m in a given collection M., 
this procedure first amounts to computing a covariance estimator of X within the GMRFs of 
neighborhood m. Then, it selects a neighborhood m by applying a penalization strategy. The so- 
defined method satisfies a nonasymptotic oracle type inequality. If X is a GMRF, the procedure is 
also minimax adaptive to the sparsity of its neighborhood. More generally, the procedure is adaptive 
to the rate of approximation of the true distribution by GMRFs with growing neighborhoods. 

Key-words: Gaussian field, Gaussian Markov random field, model selection, pseudolikelihood, 
oracle inequalities, Minimax rate of estimation. 
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Estimation adaptative de champs gaussiens stationnaires 



Resume : Nous etudions l'estimation non-parametrique d'un champ gaussien stationnaire X 
observe sur un reseau regulier. Dans le cadre des series temporelles, certaines procedures comme AIC 
realisent une selection de modele optimale parmi les modeles autoregressifs. Cependant, il n'existe 
aucun resultat analogue d'adaptation pour des champs spatiaux. En considerant des collections 
de champs de Markov gaussiens comme des ensembles d'approximation de la distribution de X, 
nous introduisons une nouvelle methode de selection de modele pour des champs spatiaux. Pour 
tout voisinage m dans une collection M. donnee, cette procedure estime la covariance de X par 
un champ de Markov de voisinage m. Puis, elle selectionne un voisinage m grace a une technique 
de penalisation. L'estimateur ainsi defini satisfait une inegalite oracle non-asymptotique. Si X est 
un champ de Markov gaussien, la procedure est minimax adaptative a la taille de son voisinage. 
Plus generalement, nous prouvons que la procedure s'adapte a la vitesse d'approximation de la 
distribution de X par des champs de Markov gaussiens de voisinage croissant. 

Mots-cles : Champ gaussien, champ de Markov gaussien, selection de modele, pseudo-vraisemblance, 
inegalites oracles, vitesse minimax d'estimation. 
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1 Introduction 

In this paper, we study the estimation of the distribution of a stationary Gaussian field X = 
(X[i,j])(i,j)eA indexed by the nodes of a square lattice A of size p x p. This problem is often en- 
countered in spatial statistics or in image analysis. 

Various estimation methods have been proposed to handle this question. Most of them fall into 
two categories. On the one hand, one may consider direct covariance estimation. A traditional 
approach amounts to computing an empirical variogram and then fitting a suitable parametric var- 
iogram model such as the exponential or Matern model (Cressie [Cre93| Ch.2). Some procedures 
also apply to non-regular lattices. However, a bad choice of the variogram model may lead to poor 
results. The issue of variogram model selection has not been co mpletel y solved yet, although some 
procedures based on cross-validation ha ve been proposed. See [Cre93| Sect. 2.6. 4 for a discussion. 
Most of the nonparametric (Hall et al. HFH94]) and semiparametric (Im et al. ISZ07| ) methods 



are based on the spectral representation of the field. To our knowledge, these procedures have 
not yet been shown to achieve adaptiveness, i.e. their rate of convergence does not adapt to the 
complexity of the correlation functions. 

An alternative approach to the problem amounts to considering the conditional distribution 
at one node given the remaining nodes. This point of view is closely connected to the notion of 
Gaussian Markov Random field (GMRF). Let Q be a graph whose vertex set is A. The field X is 
GMRF with respect to Q if it satisfies the following property: for any node £ A, conditionally 
to the set of variables X[k,l] such that (k,l) is a neighbor of in G, X[i,j] is independent 

from all the remaining variables. GMRFs are also sometimes called Gaussian graphical models. 
A huge literature develops around this subject since Gaussian graphical models are promising 
tools to analyze complex high-dimensional systems involved for instance in postgenomic data. In 
other applications, GMRFs are relevant b ecause they allow to per form M arkov chain Monte C arlo 
run fastly using Markov properties (e.g. [RT02l |). See Lauritzen [Lau96l | or Edwards EdwOOl ] for 



introductions to Gaussian graphical models and Markov properties. In the sequel, we assume that 
the node (0, 0) belongs to A. Since we assume that the field X is stationary, defining a graph Q is 
equivalent to defining the neighborhood m of the node (0, 0). Indeed, the neighborhood of any node 
€ A is the transposition of m by In the sequel, we call m the neighborhood of a GMRF. 

If the neighborhood is empty, then the Markov property states that the components of X are all 
independent. Alternatively, any zero- mean Gaussian stationary field is a GMRF with respect to 
the complete neighborhood (i.e. containing all the nodes except (0,0)). 

Numerous papers have been devoted to parametric estimation for stationary GMRFs with a 
known neighborhood. The authors have derived their asymptotic properties of such estimators (see 



BM75I . lBes77l . lGuv87l |). If the field X is assumed to be a GMRF with respect to a known neigh- 
borhood in all thes e works, the issue of n eighbor hood selection has be en less studied. Besag and 
Kooperberg |BK95| . Rue and Tjelmeland |RT02| |. Song et al. jSFGOSj . and Cressie and Verzelen 



[CV08l | have tackled the problem of approximating the distribution of a Gaussian field by a GMRF, 
but this requires the knowledge of the true distribution. Guyon and Yao have stated in |GY9fll | 
necessary conditions and sufficient conditions for a model selection procedure to choose asymptot- 
ically the true neighborhood of a GMRF with probability one. 
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In this paper, we study a nonparametric estimation procedure based on neighborhood selection. 
In short, we select a suitable neighborhood and estimate the distribution of X in the space of 
stationary GMRFs with respect to this neighborhood. The objective is not to estimate the "true" 
neighborhood. We rather want to select a neighborhood that allows to estimate well the distribution 
of X (i.e. to minimize a risk). In fact, we do not even assume that the true correlation of X 
corresponds to a GMRF. This estimation procedure is relevant for two main reasons: 



To our knowledge, it is the first nonparametric estimator in a spatial setting which achieves 
adaptive rates of convergence. 

In most of the statistical applications where GMRFs are involved, the neighborhood is a priori 
unknown. Our procedure allows to select a "good" neighborhood. 



Our problem on a two-dimensional field has a natural one-dimensional counterpart in time series 
analysis. It is indeed known that an auto- regressiv e process (AR) of order p is also a GMRF with 
2p nearest neighbors and reciprocally (see |Guy95l | Sect. 1.3). In this one-dimensional setting, our 
issue reformulates as follows: how can we select the order of an AR to estimate well the distribution 
of a time series? It is known that order selection by minim ization of criteria like AICC, A IC or FPE 
satisfy asymptotically o racle in equalities (Shibata [Shi80| and H urvich and Tsai HT8^|). We refer 



to Brockwell and Davis BD91 | and McQuarrie and Tsai [MT98l | for detailed discussions. However, 



one cannot readily extend these results to a spatial setting because of computational and theoretical 
difficulties. 

In the rest of this introduction, we further describe the framework and we summarize the main 
results of the paper. 



1.1 Conditional regression 

Let us now make precise the notations and present the ideas underlying our approach. In the sequel, 
A stands for the toroidal lattice of size p x p. We consider the random field X = {X[i,j])i<i,j< p 
indexed by the nodes of A. Besides, X v refers to the vectorialized version of X with the convention 
X[i,j] = X v [(i-i)xp+j] for any 1 < i,j < p. Using this new notation amounts to "forgetting" the 
spatial structure of X and allows to get into a more classical statistical framework. For the sake of 
simplicity, the components of X are defined modulo p in the remainder of the paper. 

Throughout this paper, we assume the field X is centered. In practice, the statistician has to 
first subtract some parametric form of the mean value. Hence, the vector X v follows a zero-mean 
Gaussian distribution J\f(0, E), where the p 2 x p 2 matrix S is non singular but unknown. Besides, 
we suppose that the field X is stationary on the torus A. More precisely, for any r > 0, any 
G {1, . . . ,p} 2 , and any (ki, l±), . . . , (k r , l r ) G {1, . . . ,p} 2r , it holds that 

(X[kl,h], . . . , X[k r ,t r ]) ~ . . . , X[k r +i,lr+j]) ■ 

We observe n > 1 i.i.d. replications of the vector X v . In the sequel, X v denotes the p 2 x n 
matrix of the n observations of X v . For any 1 < i < n, the p x p matrix X, stands for the i-th 
observation of the field X. All these notations are recalled in Table Q] in Section [L4l In practice, 
the number of observations n often equals one. Our goal is to estimate the matrix E. 

We sometimes assume that the field X is isotropic. Let G be the group of vector isometries of 
the unit square. For any node G A and any isometry g G G, g.(i,j) stands for the image of 
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in A under the action of g. We say that X is isotropic on A if for any r > 0, g <G G, and 
(h,h),...,(k r ,l r )€{l,...,p} 2r , 

{X[k u h], X[k r ,lr]) ~ (Alg.f/d,;!)], . . . , X[ff.(fc r ,i r )]) . 

As mentioned earlier, we aim at estimating the distribution of the field A throug h a conditional 



distribution approach. By standard Gaussian derivations (see for instance [Lau9a | App.C), there 
exists a unique p x p matrix 9 such that 9[o,o] = and 

X[o,o] = 0[i,j]X[i,j] + e[o,o] , (1) 

(*J)6A\{(0,0)} 

where the random variable e[o,o] follows a zero-mean normal distribution and is independent from the 
covariates (-X"[i,j])(i,j)eA\{(o,o)}- Equation ([J) describes the conditional distribution of X[o,o] given 
the remaining variables. Since the field X is stationary, the matrix 9 also satisfies 9[i,j] = 9[-i,-j] 
for any £ A. Let us note a 2 the conditional variance of X[o,o] and I p 2 the identity matrix of 
size p 2 . The matrix 9 is closely related to the covariance matrix £ of X v through the following 
property: 

S = ( r 2 (/ p2 -C(0))" 1 , (2) 

where the p 2 x p 2 matrix C{9) is defined as C(9)[ii(p-i)+j lt i 2 (j>-i)+h] := 9[i 2 -i-L,h-ji] for any 1 < 
ii,Z2,ji,j2 < P- The matrix (I p 2 —C{9)) is called the partial correlation matrix of the field X. The 
so-defi ned matrix C{9) is symmetric block ci rculant with p x p blocks as stated below. We refer to 
[RH05l l Sect. 2.6 or the book of Gray [Gra06j for definitions and main properties on circulant and 
block circulant matrices. 

Lemma 1.1. Let 9 be a square matrix of size p such that 

for any 1 < i,j < p, 9[ij] = 9[-i,-j], (3) 

then the matrix C{9) is symmetric block circulant with px p blocks. Conversely, if B is a p 2 x p 2 
symmetric block circulant matrix with p x p blocks, then there exists a square matrix 9 of size p 
satisfying and such that B = C(9). 



A proof is given in the technical appendix [Ver09b|. In conclusion, estimating the matrix E/ct 2 
amounts to estimating the matrix C(6>), which is also equivalent to estimating the px p matrix 9. 
This is why, we shall focus on the estimation of the matrix 9. 



Let us precise the set of possible values for 9. In the sequel, 9 denote the vector space of the pxp 
matrices that satisfy 0[o,o] = and 9[i,j] = 9[-i,-j], for any (i, j) G A. A matrix 9 G 6 corresponds 
to the distribution of a stationary Gaussian field if and only if the p 2 x p 2 matrix (I p 2 — C(9)) is 
positive definite. This is why we define the convex subset + of by 

9+ := {9 eO s.t. (I P 2 - C{9)) is positive definite} . (4) 

The set of covariance matrices of stationary Gaussian fields on A with unit conditional variance is 
therefore in one to one correspondence with the set 8 + . Let us define the corresponding set 8 1S0 
and 0+' 1SO for isotropic Gaussian fields. 

9 iso -={9eQ , 9[i,j] = 9[g.(i,j)] , V(i,j) e A, Vg G G} and 9+- iso := 9+ n 6 iso . (5) 
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1.2 Model selection 

We have the issue of covariance estimation as an estimation problem for conditional regressions 
(Equation {TJ). However, the set 9+ of admissible parameters for the estimation is huge. The 
dimension of 6 is indeed of the same order as p 2 whereas we only observe p 2 non-independent data 
if n equals one. In order to avoid the curse of dimensionality, it is natural to assume that the target 
9 is approximately sparse. 

It is indeed likely that the coefficients 9[i,j] are close to zero for the nodes which are far 
from the origin (0,0). By Equation ([J), this means that X[o,o] is well predicted by the covariates 
X[i,j] whose corresponding nodes are close to the origin. In other terms, the true covariance 
is presumably well approximated by a GMRF with a reasonable neighborhood. The main difficulty 
is that we do not know a priori what "reasonable" means. We want to adapt to the sparsity of the 
matrix 9. 



In the sequel, to refers to a subset of A \ {0,0}. We call it a model. By Equation (JTJ) , the 
property U X is a GMRF with respect to the neighborhood to" is equivalent to "the support of is 
included in m". We are given a nested collection M. of models. For any of these models to £ A4, we 
compute 9„ l:P1 the Conditional least squares estimator (CLS) of 9 for the model m by maximizing 
the pseudolikelihood over a subset of matrices 9 whose support is included in m. These estimators 
as well as their dependency on the quantity p\ are defined in Section [51 

The model to that minimizes the risk of 9 m>pi over the collection M. is called an oracle and is 
noted m*. In practice, this model is unknown and we have to estimate it. The art of model selection 
is to pick a model m € M that^ is large enough to enable a good approximation of 9 but is small 
enough so that the variance of 9 mtPl is small. Let us reformulate the approach in terms of GMRFs: 
given a collection M. of neighborhoods, we compute an estimator of 9 in the set of GMRFs with 
neighborhood to, for any m € A4. Our purpose is to select a suitable neighborhood to so that the 
estimator 9^ has a risk as small as possible. 

A classical method to estimate a good model rh is achieved through penalization with respect 
to the size of the models. In the following expression, J n ,p(-) stands for the CLS empirical contrast 
that we shall define in Section [2j We select a model fa by minimizing the criterion 



to = arg mm 



Jn,p(9m, Pl ) +pen(m) . (6) 



where pen(.) denotes a positive function defined on M. In this paper, we prove that under a suitable 
choice of the penalty function pen(.), the risk of the estimator 9^ is as small as possible. 



1.3 Risk bounds and adaptation 

We shall assess our procedure using two different loss functions. First, we introduce the loss 
function l(., .) that measures how well we estimate the conditional distribution ([TJ) of the field. For 
any 9i, 9 2 € 6, the distance l(9i, 2 ) is defined by 

l(9i,9 2 ) := ^tr[{C{9 1 )-C{9 2 ))H{C{9 1 )-C{9 2 ))] . (7) 



INRIA 



Estimation of stationary Gaussian fields 



7 



Let us reformulate l{6\, fe) in terms of conditional expectation 

l e |[E 9l (X[o,o]|X A \ {0 ,o}) -Ee 2 (X[o,o]|X A \ {0 ,o})] 2 } , 



= E fl 



where Eg(.) stands for the expectation with respect to the distribution of X v , Af(0, a 2 (I p 2 —C(0)) 1 ). 
Hence, 1(0,9) corresponds the mean squared predi ction l oss which is often used in the random 
design regression framework, in time series analysis |HT89j . or in spatial statistics |SFG08l |. More- 
over, t he loss function 1(9 , 9) is also connected to the notion of kriging error. The kriging predictor 
(Stein |Ste99| ) of X [0,0] is defined as the best linear combination of the covariates (^[M])(fc.i)6A\{(o.o} 
for predicting the value JT[o,o]. By Equation ([1]), this predictor is exactly Yl(k z)eA\{(o 0} 9[k,i]X[k,i] 
and the mean squared prediction error is a 2 . If we do not know 9 but we are given an estimator 9, 
then the corresponding kriging predictor J2(k i)eA\{(o 0} 9[k,i]X[k,i] has a mean squared prediction 

error equal to a 2 +1(9, 9). Kriging is a key concept in spatial statistics and it is therefore interesting 
to consider a loss function that measures the kriging performances when one estimates 0. 

We shall also assess our results using the Frobenius distance noted \\-\\f and defined by \\A\\ 2 F := 
Si<i j< P A[i,j} 2 . Observe that the Frobenius distance \\9\ — fell! also equals the Frobenius distance 
between the partial correlation matrices (I p 2 — C(9i)) and (I p 2 — C(fe)) (up to a factor p 2 ) 



life - fell! = -o||(V - 0(9,)) - (J p2 - C(fe))|| 



2 

F I 



(8) 



Our aim is then to define a suitable penalty function pen(.) in so that the estimator 
performs almost as well as the oracle estimator 6* m * ;Pl . For any model m 6 M, we define 9„ ltPl 
the matrix which minimizes the loss 1(9' , 9) over the sets of matrices & corresponding to model 



m,pi 

as 



, 9) is called the bias. Our main result is stated in Section [3j We provide a condition 



The loss 1(9,, 

on the penalty function pen(.), so that the selected estimator satisfies a risk bound of the form 

Card(m) 









E e 


I (Ofh, Pl , Oj 


< L inf 







l(6m, Pl ,9) + <Anax(£)- 



(9) 



where (p max (S) is the largest eigenvalue of £ and Card(.) stands for the cardinality. Contrary to 
most results in a spatial setting, this upper bound on the risk is nonasymptotic and holds in a 
general setting. The term <p max (£)Card(m)/(np 2 ) grows linearly with the size of m and goes to 
with n and p. In Section [H we prove that the variance term of a model m is of the same order 
as ip max (T,)Card(m)/(np 2 ). Hence, the bound (J9j) tells us that the risk of 9ff l>pi is smaller than 
a quantity which is the same order as the risk Ee[l(9 m *, Pl , 9)] of the oracle m* . We say that the 
selected estimator achieves an oracle-type inequality. 



In SectionJU we bound the asymptotic expectations E[7(6>. 



and connect them to the vari- 
ance terms in Bound ([9]). As a consequence, we prove that under mild assumptions on the target 9, 
the upper bound © is optimal from the asymptotic point of view (up to a multiplicative numerical 
constant). We discuss the assumptions in Section In Section [6l we compute nonasymptotic mini- 
max lower bounds with respect to the loss functions l(., .) and ||.||fn. We then derive that under mild 
assumptions, our estimator 9ff l . pi is minimax adaptive to the sparsity of 9 and minimax adaptive 



m,pi j 1 
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to the decay of 9. 



To our knowledge, these are the first oracle-type inequalities in a spatial setting. The computa- 
tion of the minimax rates of convergence is also new. Moreover, most of our results are nonasymp- 
totic. Although we have considered a square on the two-dimensional lattice, our method straight- 
forwardly extends to any d-dimensional toroidal rectangle with d > 1. In t he one -dimensional 
setting, we retrieve a oracle- type inequality that is close to the work of Shibata [Shi80| . Yet, he has 
stated an asymptotic oracle inequality for the estimation of autoregressive processes. In contrast, 
our result applies on a torus and is only optimal up to constants but it is nonasympotic and most of 
all applies for higher dimensional lattices. In Section [TJ we further discuss the advantages and the 
weak points of ou r metho d. Moreover, we mention the extensions and the simulations made in a 
subsequent paper Ver09a |. All the proofs are postponed to Section [8] and to the appendix Ver09b| . 



1.4 Some notations 

Throughout this paper, L, L\, L2, ■ ■ ■ denote constants that may vary from line to line. The nota- 
tion L(.) specifies the dependency on some quantities. For any matrix A, (p ma ^(A) and ip m i n (A) 
respectively refer the largest eigenvalue and the smallest eigenvalues of A. We recall that \\A\\p is 
the Frobenius norm of A. For any matrix 8 of size p, \\9\\i stands for the sum of of the absolute 
values of the components of 9, we call it its l\ norm. In the sequel, P is the square matrix of size p 
whose indices are 0. Given p > 0, the ball Bi(0 p ;p) is defined as the set of square matrices of size 
p whose h norm is smaller than p. Finally, Table Q] gathers the notations involving X. 



X 


Matrix of size p x p 


Random field 


X v 


Vector of length p l 


Vectorialized version of X 


x v 


Matrix of size p 2 x n 


Observations of X v 




Matrix of size p x p 


i-th observation of the field X 



Table 1: Notations for the random field and the data. 



2 Model selection procedure 

In this section, we formally define our model selection procedure. 
2.1 Collection of models 

For any node (i, j) belonging to the lattice A, let us define the toroidal norm by 

\(i,j)\ 2 t := [iA(p- i)f + [j A {p - j)f 

We aim at selecting a "good" neighborhood for the GMRF. Since X corresponds to some "spatial" 
process, it is natural to assume that nodes that are close to (0,0) are more likely to be significant. 
This is why we restrict ourselves in the sequel to the collection Mi of neighborhoods. 

Definition 2.1. A subset m C A\{(0, 0)} belongs to M.\ if there exists a number r m > 1 such that 
m = {(i,i)GA\{(0,0)} s.t. \(i,j)\ t <r m } . (10) 
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Figure 1: Examples of models. The four gray nodes refer to mi. The model rri2 also contains the 
nodes with a cross whereas ms contains all the nodes except (0,0). 



The collection Mi is totally ordered with respect to the inclusion and we therefore order our 
models uiq C mi C.Cmj.... For instance, mo corresponds to the empty neighborhood whereas 
mi stands for the neighborhood of size 4. See Figure Q] for other examples. 

For any model ra £ Mi, we define the vector space m as the subset of the elements of 8 whose 
support is included in m. We recall that is defined in Section ITTTl Similarly 0^° is the subset 
of 1SO whose support is included in m. The dimensions of TO and 0^° are respectively noted d m 
and d™. Since we aim at estimating the positive matrix (I p 2 — C(6)), we shall consider the convex 
subsets of 0+ and 0„ lso that correspond to non-negative precision matrices. 

0+ := m n 0+ and 0+' iso := 6™ fl + < iso . (11) 

For instance, the set 0+ is in one to one correspondence with the sets of GMRFs whose neighbor- 
hood is made of the four nearest neighbors. Similarly, 0+ t is in one to one correspondence with the 
GMRFs with eight nearest neighbors. In our estimation procedure, we shall restrict ourselves to 
precision matrices whose largest eigenvalue is upper bounded by a constant. This is why we define 
the subsets 9^ 2 pi and 9^'™ for any pi >2. 

:= {6eQ+,<p m ^(l p2 -C(9))< Pl } (12) 
:= {0 e 0m' is ° , ¥W (V - G{6)) < pi} . (13) 



Finally, we need a generating family of the spaces m and© 1 ^ . For any node € A\{(0,0)}, 
let us define the p x p matrix as 

^ (1 if {k,l) = {i,j)ox(k,l) = -(i,j) 

J 1 J \ otherwise . v ' 

Hence, m is generated by the matrices ^ij for which belongs to m. Similarly, for any 

(i, j) e A \ {(0,0)}, let us define the matrix *j s ° by 

^ iso f 1 if 3g€G,{k,l)=g.(i,j) ( , 

' 1 -~ 1 otherwise . li0J 
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2.2 Estimation by Conditional Least Squares (CLS) 

Let us turn to the conditional least squares estimator. For any 9' G 6 + , the criterion J n , p (9') is 
defined by 

1 n ( \ 2 

ln,p{ 9 ') ■= — 2 ( — Y O'ViMXitii+iuh+h] J ■ (16) 



np »=i i<j'ij2<P v (i!,i 2 )eA\{(o,o)} 

In a nutshell, 7 n , p (#') is a least squares criterion that allows to perform the simultaneous linear 
regression of all X* [31,32] with respect to the covariates (Xi[Zi,j 2 ])(i 1) j 2 ) 7 4fj 1) j 2 ). The advantage of this 
criterion is that it does not require the computation of a determinant of a huge matrix as for the 
likelihood. We shall often use an alternative expression of 7 n ,p(#') in terms of the factor C{9') and 
the empirical covariance matrix X V X V * : 

ln , p {9') = ^tr[(I p ,-C(9'))^y^(I p ,-C{9'))} . (17) 

One proves the equivalence between these two expressions by coming back to the definition of C{6'). 
Let pi > 2 be fixed. For any model m G M, we compute the CLS estimators m ,pi an d C°pi by 
minimizing the criterion 7n,p(-) as follows 

9 m ,pi ■= ar g min J n , P ( ') and C°pi := ar § mil1 7n. P (^) , (18) 

where A stands for the closure of the set A. The existence and the uniqueness of 9 m>pi and 9™° pi 
are ensured by the following lemma. 

Lemma 2.2. For any 9 G 6 + , 7n. P (-) *s almost surely strictly convex on 0+. 

The proof is postponed to the appendix Ver09 bj . We discuss the dependency of 9 m%Pl on the 
parameter p\ in Sectional For stationary Gaussian fields, minimizing the CLS criterion 7«,p(-) over 
a set ©m,pi i s equivalent to minimizing the product of the conditional likelihoods (X[i,j]\X_^ t jy), 
called Conditional Pseudo- Likelihood (CPL): 

p£ n (9',XV) := [] C n , e < (X ib i, j2 ]|(X,)_ {j , j2} ) = (V2wf V exp , 

1 < i < n, 
(h,h) G A 

where we recall that tr 2 ref ers to t he conditional variance of any X[i,j]. In fact, CLS estimators were 
first introduced by Besag [Bes75l | who call them pseudolikelihood estimators since they minimize 
the CPL. 



Let us define the function 7(.) as an infinite sampled version of the CLS criterion J n ,p(-)' 



7 (0') :=E e [ 7n . p (0')] =E 



X[o,o] - Y 0'[i,j]X[ij]j 



(j,i)#(o,o) 



(19) 
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for any 9' , 9 € + . The function 7(6*') measures the prediction error of X[o,o] if one uses Ylu j)^(o 0) ^'Wl-X'W] 

as a predictor. Moreove r, it is a special case of the CMLS criterion introduced by Cressie and Verze- 

len in (Eq.10) of [CVPSt ] to approximate a Gaussian field by a GMRF. Hence, one may interpret the 

CLS criterion as a finite sampled version of their approximation method. Observe that the function 

7(.) is minimized over + at the point 9 and that 7(6*) = Varg(X[o,o] |X_{ ,o}) = ° 2 ■ Moreover, 

the difference 7(6*') - 7(0) equals the loss 1(6', 6) defined by (7|). 

For any model me M, we introduce the projections 9 m . pi and 9^ as the best approximation 
of 9 in Qti, P1 and 6m,p°- 

m , Pl :=arg min 1(9' ,9) and 9™ pi := arg min l(6>',6»). (20) 

0'ee+, P1 fee+'/p" 

Since 7(.) is strictly convex on + , the matrices 9 m>Pl and 9f° pi are uniquely defined. By its 
definition ([7]), one may interpret l(., .) as an inner product on the space O; therefore, the orthogonal 

projection of 9 onto the convex closed set 6m, w (resp. 0m,™) with respect to l(., .) is 6* m pi (resp. 
^m°pi)- It then follows from a property of orthogonal projections that the loss of 9 mpi is upper 
bounded by 

) ■ (21) 

The first term l(9 m ^ pi , 9) accounts for the bias, whereas the second term l(9, n ^ pi , 9„ l:P1 ) is a variance 
term. Observe that 9 € 6+ does not necessarily imply that the bias l(9 m:P1 ,9) is null because in 

general 8m 7^ Qm, Pl ■ This will be the case only if 9 satisfies the following hypothesis. 

(Hi): fftn«(/ P »-C(ff))<p 1 . (22) 

Assumption (Hi) is necessary to ensure the existence of a model m £ M. such that the bias is 
zero (i.e. 6* m pi = 9). By identity |(2]), one observes that (Hi) is equivalent to a lower bound on the 
smallest eigenvalue of E, i.e. (p m j n (E) < <r 2 /pi- We further discuss (Hi) in Sectional 



For the sake of completeness, we recall the penalization criterion introduced in ([6]). Given a 
subcollection of models M C Mi and a positive function pen : M. — > R + that we call a penalty, 
we select a model as follows 



m := arg mm 



In 



9,. 



n,p \ w vn,p\ 



pen(m) and m 11 



arg mm 



ln,p &■ 



11SO 

TO, Pi 



pen(m) 



Observe that in and m lso depend on pi . For the sake clarity, we do not emphasize this dependency 
in the notation. In the sequel, we write 9 Pl and 9 p s ° for 9ff l:Pl and 9™£ 1 . 



3 Main Result 

We now provide a nonasymptotic upper bound for the risk of the estimators 9 P1 and 9 p ° . Let us 
recall that E stands for the covariance matrix of X v . 
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Theorem 3.1. Let K be a positive number larger than a universal constant Kq and let M. be a 
subcollection of Mi. If for every model m £ M . 



pen(m) > K p\tp max (Y,)-^ 



(23) 



then for any 9 £ 8 + , the estimator 9 pi satisfies 



<Lx{K) inf [l{6 miPl ,6)+pen(m)} 

meM 



L 2 {K) ^ 

np 



(24) 



A similar bound holds if one replaces 9 P1 by 9' p ° , 8 + by 0+> ISO ; 9 mpi by 9f°, and d m by dj; 



The proof is postponed to Section l8l2l It is based on a novel concentration inequality for suprema 
of Gaussian chaos stated in Section 18.11 The constant Kq is made explicit in the proof. Observe 
that the theorem holds for any n, any p and that we have not performed any assumption on the 
target 9 £ 9 + (resp. 6 + ' 1S0 ). If the collection M does not contain the empty model, one gets the 
more readable upper bound 



'pi i 1 



<L(K) inf [Z(0 m , pi ,0)+pen(m)] . 

meM 



This theorem tells us that 9 Pl essentially performs as well as the best trade-off between the bias term 



l(0 m , Pl ,0) and p\^ 



c (£)^p- that plays the role of a variance. Here are some additional comments. 



Remark 1. Consider the special case where the target 9 belongs to some parametric set 0+ 
with m £ M.. Suppose that the hypothesis (Hi) defined in ((22j) is fulfilled. Choosing a penalty 
pen(m) = Kpjip nmx {T,)^, we get 



< L{K)p\^ 



m— 2 

np 



(25) 



We shall prove in Section 14.21 and 16.11 that this rate is optimal both from an asymptotic oracle and 
a minimax point of view. We have mentioned in Section [231 that (Hi) is necessary for the bound 
(J25| to hold. If p\ is chosen large enough, then Assumption (Hi) is fulfilled. We do not have 
access to this minimal p\ that ensures (Hi), since it requires the knowledge of 9. Nevertheless, we 
argue in Section[5]that "moderate" values for p\ ensure Assumption (Hi) when the model m is small. 



Remark 2. We have mentioned in the introduction that our objective was to obtain oracle in- 
equalities of the form 



En 



'Pi i ' 



< L(K) inf E 

meM 



'm,pi i ' 



L{K)E 



m*,pi j 1 



This is why we want to compare the sum l(9 m ^ Pl ,9) +pen(m) with E[l(9 miPl , 9)}. First, we provide 
in Section |4~T1 a sufficient condition so that the risk E[l(9 m , Pl , 9)] decomposes exactly as the sum 
l{@m,pnQ) + E[l(9 nitPl , 6*„i iPl )]. Moreover, we compute in Section [4~2l the asymptotic variance term 
E[l(9 m , Pl , 9 m , Pl )] and compare it with the penalty term pfip 



(E)^2-. We shall then derive oracle 
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type inequalities and discuss the dependency of the different bounds on <p max (E). 

Remark 3. Condition (|23|) gives a lower bound on the penalty function pen(.) so that the re- 
sult holds. Choosing a proper penalty term according to l|23"|) therefore requires an upper bound on 
the largest eigenvalue of E. However, such a bound is seldom known in practice. We shall mention 
in Section [7] a practical method to calibrate the penalty. 

A bound similar to (|24| holds for the Frobenius distance between the partial correlation matrices 
(I P 2-C(0)) and (I P 2 -C(6 P1 )). 

Corollary 3.2. Assume the same as in Theorem \3.1l except that there is equality in l[23\) . Then, 

Kp\d n 



\\C(6 P1 )-C(9) 



L 2 {K) 



^min(E) m&M 

(s) Pi 



\C{6 m ^)-C{6)\ 



(/3min(E) Tl 



(26) 



A similar result holds for isotropic GMRFs. 



Proof of Corollary 1 3. £1 This is a consequence of Theorem l3.ll By definition J7]) of the loss function 
■), the two following bounds hold 

P 2 1(01,O 2 ) > VntofflWCtfx) - C(0 2 )\\ 2 F 



p 2 l(6 1 ,9 2 )<<p max (Z)\\C(6 1 )-C(6 2 )\\ 



Gathering these bounds with |24|) yields the result. 



□ 



The same comments as for Theorem l|3.ip hold. We may express this Corollary 13.21 in terms of 

Kpfd n 



the risk E(\\e pi - 9\\ 2 F ), since ||C(0i) - C(6 2 )\\ 2 F = p 2 \\0! - 9 2 \\ 2 F : 



En 



'Pi 



< L± [K] 
+ L 2 (K) 



1 inf 



ywt(E) p\ 

Vmin (E) np 2 



'm,pi 



II 



np z 



4 Parametric risk and asymptotic oracle inequalities 

In this section, we study the risk of the parametric estimators 6 m>Pl in order to assess the optimality 
of Theorem O 



4.1 Bias- variance decomposition 

The properties of the parametric estimator 8 m . Pl and of the projection Q m , Pl differ slightly whether 
#m,pi belongs to the open set 0+ or to its border. Observe that Hypothesis (Hi) defined in (j22|) 
does not necessarily imply that the projection 8 mtPl belongs to 0+. This is why we introduce the 
condition (H2): 



0€Bi(O p ,l) 



< 1 . 



(27) 
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The condition ||0||i < 1 is equivalent to (I p 2 — C{9)) is strictly diagonally dominant. Condition 
(H2) implies that the largest eigenvalue of (J p 2 — C(6)) is smaller than 2 and therefore that (Hi) is 
fulfilled since p\ is supposed larger than 2. We further discuss this assumption in Section [H 

Lemma 4.1. Let 9 £ + such that (H2) holds and let m £ Mi. Then, the minimum of"f(.) over 
m is achieved in 0^.2- This implies that 



9„ hPl = arg min 7(6*') and j(0 m , Pl ) = Var g {X[o,o]\X m ) . 

Besides, ||0 m , pi ||i < \\6\\i- The same results holds for 9^° if 9 in 0+' lso . 

The proof is given in the technical appendix Ver09bl | . The purpose of this property is threefold. 



First, we derive that Assumption (H2) ensures that 9 m , Pl belongs 8^ pi and that the smallest eigen- 
value of (I p 2 — C(9 m , Pl )) is larger than 1 — \\0\\i. Second, it allows to express the projection 9 m , Pl in 
terms of conditional expectation (Corollary I4.2|) . Finally, we deduce a bias-variance decomposition 
of the estimator 6> TOiPl (Corollary I4.3p . In other words, the equality holds in (|2"T]l . 



Corollary 4.2. Let 9 £ + such that (H2) holds and let m £ A^i. The projection 9 mjPl is uniquely 
defined by the equation 

Eg (X[0,0]\X m ) = ^ S m,pilh3]Xli,j] , 
(i,j)em 

and 9 mtPl [i,j] — for any 4- m - Similarly, if 9 £ 0+> 1S0 satisfies (H2), then 9^ is uniquely 
defined by the equation 

Eg (X[o,o]\X m ) = C Pl M*M > 

(i,j)em 



a 9™ pi [i.j] = for any ^ m. 



Consequently, Y^i<i j< p @m,pi [hi]X[i,j] is the best linear predictor of X[o,o] given the c ovariat es 
X[i,j] with (i,j) £ m. This is precisely the definition of the kriging parameters (Stein Ste99l |). 



Hence, the matrix 6* m pi corresponds to the kriging parameters of X [0,0] with kriging neighborhood's 
range of r m . The distance r m is introduced in Definition 12.11 and stands for the radius of m. 



Corollary 4.3. Let 9 £ + such that (H2) holds and let m £ M.i. The loss of 6 m , pi decomposes 
as l(9 m ,pn9) = l(9 m ,p 1 ,9) + l(9 m ,p 1 ,9 miP1 ). If 9 belongs to 6^ 1S0 and (H2) holds, then we also have 
the decomposition 1(9™ pi , 9) = l(9™ pi , 9) + 1(9™ pi , 9 m , Pl ). 



A proof is provided in the technical appendix |Ver09bl |. If 9 does not satisfy Assumption (H2), 
then 6 m ,p! does not necessarily belong to 0„ iPl and there may not be such a bias variance decom- 
position. 

4.2 Asymptotic risk 

In this section, we evaluate the risk of each estimator 6 m , Pl and use it as a benchmark to assess the 
result of Theorem 13. II We have mentioned in Corollary 14.31 that under (H2) the risk Eg\},(9 m ,p r ,9)\ 
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decomposes into the sum of the bias l(9 mtPl ,6) and a variance term E,e[l(6 mtPl , mtPl )]. If this 
last quantity is of the same order as the penalty pen(m) introduced in ([23f . then Theorem 13,11 
yields an oracle inequality. However, we are unable to express this variance term Eg[l(9 m:P1 , 9 m ^ pi )] 
in a simple form. This is why we restrict ourselves to study the risks when n tends to infinity. 
Nevertheless, these results give us some hints to appreciate the strength and the weaknesses of 
Theorem 13. II and the upper bound ([25]) . 



In the following proposition, we adapt a result of Guyon [Guv95l | Sect. 4. 3. 2 to obtain an asymp- 
totic expression of the risk Eg[l(9 m>Pl , 9 m , Pl ))- We first need to introduce some new notations. For 
any model m in the collection M.\ \ {0}, we fix a sequence (ik>jk)k=i...d m of integers such that 
, . . . , *i dm ,j dm ) is a basis of the space m . Then, Xmlo.o] stands for the random vector of size 
d m that contains the neighbors of X[o,o] 



,tr{* idm , jdm X*)] 
Besides, for any 9 S + , we define the matrices V, W and IL m as 



V 
W[k,i] 

IL rn 



dr 



for any k 



cove (Xm [o,o]) 

C (* iw -J (V - C{9 m , Pl )f (7 p2 - C{9)Y 2 C i>l',.., 
Diag (11*^111, k = l,...,d m ) 

where for any vector u, Diag(u) is the diagonal matrix whose diagonal elements are the components 
of u. We also define the corresponding quantities Xm°l '°]> ^ 1S °> W 1S0 , and IL\ S ° in order to consider 
the isotropic estimator 0™ . 

Proposition 4.4. Let m be a model in M.\ \ {0} and let 9 be an element of 0+ that satisfies (Hi). 
Then, 6 m , Pl converges to 9 in probability and 



lim np Eg 

n — >+oo 



(flm lPl ,0)] =2aHr[lL m V- 1 ] 



(28) 



Let 9 in + such that (H2) is fulfilled. Then, Q m , pi converges to 9 mjPl in proba 



anc 



lim np Eg 

n — >+oo 



2aHr(WV~ 1 ) . (29) 
and if one replaces V , W, and 



Both results still hold for the estimator 9™ pi if 9 belongs to 0^ 
LL m by V iso , W iso , and IL™. 

In the first case, Assumption (Hi) ensures that 9 £ 8^,., whereas Assumption (H2) ensures 
that # m , Pl S ©mp!- The proof is based on the extension of Guyon's approach in the toroidal 
framework. 

The expressions |28|) and l|29|) are not easily interpretable in the present form. This is why 
we first derive ([28]) when 9 is zero. Observe that it is equivalent to the independence of the 

(■^[*>j])(t,j)eA' 

Example 4.5. Assume that 9 is zero. Then, for any model m £ A4i, the asymptotic risks of 9 miP1 
and 9™ pi satisfy 



lim np Eo 

n — >+oo 



^m,pi 7 Op 



= 2a 2 d m and lim np 2 



n — >+oo 

where we recall that d™ is the dimension of the space 0J^° . 



» °p) 



2o*& 



2 jiso 
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Proof. Since the components of X are independent, the matrix V equals a 2 IL m . We conclude by 

□ 



applying Proposition 14,41 

Therefore, when the variables X[i,j] are independent, the asymptotic risk of 9 mtPl equals, up to 
a factor 2, the variance term of the least squares estimator in the fixed design Gaussian regression 
framework. This quantity is of the same order as the penalty introduced in Section [3j When the 
matrix 9 is non zero, we can lower bound the limits (|28f and fl29j) - 



Corollary 4.6. Let m be a model in M.\ and let 9 € 0+ that satisfies (Hi). Then, the variance 
term is asymptotically lower bounded as follows 



lim np Eg 

n — >+oo 



'm,pi j 1 



> La 2 ip min [l p 2 - C(9)\ d rn = La A - 



where L is a universal constant. Let 9 € + that satisfies (H2). For any model m € M.\, 



(30) 



lim np K@ 

n — !-+oo 



l(9 m . Pl ,9 m , Pl )] >£<7 2 (l-j|#||i) 



I) d r , 



(31) 



The proof is postponed to the technical appendix |Ver09bl |. Again, analogous lower bounds 
hold for 9 1 *° pi when 9 belongs to 1SO >+. This corollary states that asymptotically with respect 

to n the variance term of 9 m>Pl is larger than the order d m /{np 2 ). This expression is not really 
surprising since d m stands for the dimension of the model m and np 2 corresponds to the number of 



data observed. Let define Rs,oo(9 m ,p 1 , 9 m , Pl ) 

y m,pi 



lim n ^ +00 np 2 Ee[l(9 miPl , m , Pl )] as the asymptotic 



variance term for 9 m . n , rescaled by the number np 2 of observations. 



The first part of the corollary |30|) states that from an asymptotic point of view the upper bound 
25| is optimal. By Theorem l3.lt if we choose pen(m) = K p 2 tp max (T,)^fc , then it holds that 



'Pl ! 1 



< 



L(K, Pl ,<p min [/ p2 -C(0)]) 



Re 



00 V m,pi 1 

np 2 



for any model m € M. \ and any 9 £ 0+ that satisfies (Hi). This property holds for any n and 
any p. Hence, 9 P1 performs as well as the parametric estimator 9 m , Pl if the support of 9 belongs to 
some unknown model m and if 9 satisfies (Hi). 



If we assume that ||#||i < 1 (Hypothesis (H 2 )), we are able to derive a stronger result. 



Proposition 4.7. 



K > Kq, pi > 2, i] < 1 and a collection M. C M.\ \ 0, we define 



the estimator 9 P1 with the penalty pen(m) = Kp\ 
by 



np 2 (1— r}) 



. Then, the risk of 9 P1 is upper bounded 



<L(K, Pl ,r]) inf <l(0 m , Pl ,6) 



R$, 00 ( ^m,pi s 9 m pi 



Tip* 



(32) 



for any 9 E 6+ n B x (0 p ,n). 



INRIA 



Estimation of stationary Gaussian fields 



17 



Observe that this property holds for any n and any p. If the matrix 9 is strictly diagonally dom- 
inant, we therefore obtain an upper bound similar to an oracle inequality, except that the variance 
term Eg[l(9 myPl ,9 mtPl )] has been replaced by its asymptotic counterpart Re,oo(9 m , pi , 9 m , Pl )/(np 2 ). 
However, this inequality is not valid uniformly over any r) < 1 : when n converges to one, the 
constant L(K,p\,r\) tends to infinity. Indeed, if ||#||i converges to one, the lower bound (|3Tj) on the 
variance term can behave like (1 — \\9\\i) 3 d m / (np 2 ) for some matrices 9 whereas the penalty term 
d m /[np 2 (l — ||0||i)] tends to infinity. 

In the remaining part of the section, we illustrate that the constant L(K,r],pi) has to go to 
infinity when r\ goes to one. Let us consider the model mi. It consists of GMRFs with 4-nearest 
neighbors. 

^> 

Example 4.8. Let 9 be a non zero element o/0J^°, then the asymptotic risk of 0J^° pi simplifies 
as 



lim np Eg 

n — >+oo 



= 2- 



[1,0] 



cov(X [1,0], X [o,o]) 

If we let the size p of the network tend to infinity and 9 [1,0] go to 1/4, the risk is equivalent to 

16er 2 (l - 40[i,o]) 



(33) 



lim lim np 



9[i,o] -> 1/4 



log(16) 

The proof is postponed to the technical appendix Ver09b| . If follows from the second result that 
the lower bound (|30| is sharp since in this particular case tp m i„(I p 2 — C(9)) = a 2 (l — 40[i,o]). When 



0[i,o] tends to 1/4, then ||0||i tends to one and Ee[l(9% o uPl ,0)] behaves like <r 2 (l - PW^df^ / {np 2 ) 
whereas the penalty pen(mi) given in Theorem 13.11 has to be larger than a 2 d 1 ^ / '[np 2 (1 — ||0||i)]- 
Hence, the variance term and the penalty pen(.) are not necessarily of the same order when ||0||i 
tends to one. Theorem 13.11 cannot lead to an oracle inequality of the type l(32|) . which is valid 
uniformly on 77 < 1. 

Example 4.9. Let a be a positive number smaller than 1/4. For any integer p which is divisible 
by 4, we define the p x p matrix 9^ by 



'[-p/4,p/4] = 9 {pS> [p/4-p/A] 



[-p/i-p/i] 



9^[p/4.,p/4] = 

Then, the variance term is asymptotically lower bounded as follows 



a 

else . 



lim 



lim np Eg( P ) 

+00 n — >+oo 



I mi, pi ' L Jmi,pi 



> 



La 2 
1 - 4a 



The proof is postponed to the technical appendix Ver09b| . This variance term is of order 



a 2 df° / [np 2 (1 — ||0||i)] = tp max (T,)d^° I (np 2 ) when ||0||i goes to one. The penalty pen(m) introduced 
in Proposition 14.71 is therefore a sharp upper bound of the variance terms. 

On the one hand, we take a penalty pen(m) larger than a 2 d m /(np 2 (l — ||0||i)). On the other 
hand, the variance of 9 m . Pl is of the order cr 2 (l — \\9\\ i)d m / (np 2 ) in some cases. The bound (|32]l 
cannot therefore hold uniformly over any 77 < 1. We think that it is intrinsic to the penalization 
strategy. 
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5 Comments on the assumptions 

In this section, we discuss the dependency of the estimators 9 m>Pl on p\ as well as Assumptions 
(Hi) and (H 3 ). 

Dependency of 9 m , Pl on p\. We recall that the estimator 6 m , Pl is defined in fl8|) as the min- 
imizer of the CLS empirical contrast 7n,p(-) over 0+ . . It may seem restrictive to perform the 
minimization over the set 0+ pi instead of 0+ . Nevertheless, we advocate that it is not the case, 
at least for small models. Let us indeed define 

p(m) := sup ip max [l p 2 - C{6)] and p tso (m) := sup ip max [l p3 - C(9)] . 
eee+ 869+'"° 

The quantities p{m) and p lso (m) are finite since 0+ is bounded. If one takes p\ larger than pirn) 
(resp. p lso (m)), then the set 9^ pi (resp. 6^'™) is exactly 0+ (resp. 0+' lso ). We illustrate in 
Table [2] that p(m) and p lso (m) are small, when the model to is small. Consequently, choosing a 
moderate value for p\ is not really restrictive for small models. However, when the size of the model 
to increases, the sets 8^ ft and 0+ become different for moderate values of p\. In Section [7j we 
discuss the choice of p\ . 



dm 


2 


4 


6 


10 


p{m) 


2.0 


4.0 


5.0 


6.8 


J1SO 


1 


2 


3 


4 


p lso (m) 


2.0 


4.0 


5.0 


6.8 



Table 2: Approximate computation of p(m) and p lso (m) for the four smallest models with p = 50. 

Assumption (Hi) defined in i|22|) states that the largest eigenvalue of (I p 2 — C{9)) is smaller 
than p±. We have illustrated in Table [2] that if the support of 9 belongs to a small model to, then 
the maximal absolute value of [I p 2 — C(9)) is small. Hence, Assumption (Hi) is ensured for "mod- 
erate" values of p\ as soon as the support of 9 belongs to some small model. If 9 is not sparse but 
approximately sparse it is likely that the largest eigenvalue of 9 remain moderate. In practice, we 
do not know in advance if a given choice of p\ ensures (Hi). In Section [3 we discuss an extension 
of our procedure which does not require Assumption (Hi). 

Assumption (H2) defined in (f27]l states that 9 e Bi(0 p , 1) or equ ivalently that the matrix 
(I p 2 — C{9)) is diagonally dominant. Rue and Held prove in [RH05| Sect. 2. 7 that 0+j is in- 
cluded in £>i(O p ,l). They also point out that a small part of 0^ l2 does not belong to £>i(O p ,l). 
In fact, Assumption (H2) becomes more and more restrictive if the support of 9 becom es large r. 
Nevertheless, Assumption (H2) is also quite common in the literature (as for instance in [Guv95l p. 

If one looks closely at our proofs involving Assumptions (H 2 ), one realizes that this assumptions 
is only made to ensure the following facts: 

1. The projection 9 m>pi belongs to the open set ©^.m f° r an Y model m £ M. (Corollary 14. 3p . 

2. The smallest eigenvalue of (I p 2—C(9 m , pi )) is lower bounded by some positive number p 2 , .uniformly 
over all models to G M. 
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From empirical observations, these two last facts seem far more restrictive than (H2). We used 
Assumption (H2) in the statement of our results, because we did not find any weaker but still 
simple condition that ensures facts 1 and 2. 



6 Minimax rates 

In Theorem l3.1l and Proposition 14. 71 we have shown that under mild assumptions on 9 the estimator 
9 P1 behaves almost as well as the best estimator among the family {9 mtPl , m £ M}. We now 
compare the risk of 9 P1 with the risk of any other possible estimator 9. This includes comparison 
with maximum likelihood methods. There is no hope to make a pointwise comparison with an 
arbitrary estimator. Therefore, we classically consider the maximal risk over some suitable subsets 
T of + . The minimax risk over the set T is given by infg-sup egT Ee[l(9, 9)], where the infimum 
is taken over all possible estimators 9 of 9. Then, the estimator 9 Pl is said to be approximately 
minimax with respect to the set T if the ratio 



su Peer E e 


['( 






infg-sup egT E e 


l (0,0' 





is smaller than a constant that does not depend on a 2 , n or p. An estimator is said to be adaptive to 
a collection if it is simultaneously minimax over each %. The problem of designing adaptive 

estimation procedures is in general difficult. It has bee n extensively studied in the fixed design 
Gaussian regression framework. See for instance |BM0lj for a detailed discussion. In the sequel, 
we adapt some of their ideas to the GMRF framework. 

We prove in Section [BTTl that the estimator 9 Pl is adaptive to the unknown sparsity of the matrix 
9. Moreover, it is also adaptive if we consider the Frobenius distance between partial correlation 
matrices. In Section [621 we show that 9 Pl is also adaptive to the rates of decay of the bias. 

We need to restrain ourselves to set of matrices 9 such that the largest eigenvalue of the covari- 
ance matrix £ is uniformly bounded. This is why we define 

Vp 2 > 1 , U( P2 ) := jfl 6 9, <^ min (I P 2 - C{9)) > i- j . (34) 

Observe that 9 £ U{p2) is exactly equivalent to v'max(S) < a 1 pi since £ = a 2 (I p 2 — C{9)). 



6.1 Adapting to unknown sparsity 

In this subsection, we prove that under mild assumptions the penalized estimator 9 P1 is adaptive 
to the unknown sparsity of 9. We first lower bound the minimax rate of convergence on given 
hypercubes. 

Definition 6.1. Let m be a model in the collection Mi \ 0. We consider . . . , *&i dm ,j d ) a 

basis of the space <d m defined by For any 9' £ 0+ , the hypercube C m {6' ', r) is defined as 

C m {0',r):=U' + Y^ ih , jk <p k , </>e{0,l} d "j , 
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if the positive number r is small enough so that C m {6' , r) C 9 + . For any 9' <G 9+' 1S0 , we analogously 
define the hypercubes C' r s ° (9', r) using a basis {^f°j 1 , ■ ■ ■ , j d ) - 

Proposition 6.2. Let m be a model in M.i \ whose dimension d m is smaller than py/n. Then, 
for any estimator 9, 



SUp Eg 




> sup Eg 













> La 



2 "m 

np 2 



(35) 



Let 9' be an element of 9+ that satisfies (Hb). For any estimator 9 of 9, 



sup 



En 



> Lf#** [V - C(9')] ^ 



(36) 



where Co [C m (9',r)] denotes the convex hull of C m (0',r). 

An analogous result holds for isotropic hypercubes. The first bound (|35| means that for any 
estimator 9, the supremum of the risks Eg[l(9 m . Pl , 9)] over 9+ is larger than a 2 d m /(np 2 ) (up to 
some numerical constant). This rate o~ 2 d m / (np 2 ) is achieved by the CLS estimator by Theorem l3.ll 

The second lower bound (|36|) is of independent interest. It implies that in a small neighborhood 
of 9' the risk Eg{l(9 m , Pl , 9)] is larger than a 2 tp 2 nin {I p 2 — C(9')]d m /(np 2 ). This confirms the lower 
bound (j30j) of Corollary 14.61 in a nonasymptotic way. Indeed, these two expressions match up to a 
factor <p m i„[I p 2 — C(9')]. This difference comes from the fact that the lower bound l|36p holds for 
any estimator 9. Bound f36|) is sharp in the sense that the maximum likelihood estimator Q^ mle 
of isotropic GMRF in mi exhibits an asymptotic risk of order cr 2 ip 2 riin [I p 2 — C(9)]/(np 2 ) for the 
parameter 9 studied in Example 14.81 It is shown using the methodology introduced in the proof of 
Example 14.81 We now state that 9 p is adaptive to the sparsity of m. 

Corollary^ 6.3. Considering K > Kq, pi > 2, p2 > 2 and a collection M C M.\, we define the 
estimator 9 P1 with the penalty pen(m) = Ka 2 p\p-2^z ■ For any non empty model m, 



sup Eg I 

9ee+, P1 nW(>2) 



where U(p2) is defined in \3$ - 



< L (K, Pl ,p 2 ) inf sup E I [9,9 

o eee+, P1 nw( P2 ) 



(37) 



A similar result holds for 9 p s ° and 9+'^°. 



Corollary 16.31 is nonasymptotic and applies for any 
n and any p. If 9 belongs to some model m, then the optimal risk from a minimax point of view 
is of order ^jy. In practice, we do not know the true model to. Nevertheless, the procedure 

simultaneously achieves the minimax rates for all supports to possible. This means that 9 P1 reaches 
this minimax rate without knowing in advance the true model m. 

The procedure is not adaptive to the smallest and the largest eigenvalue of (L p 2 — C(9)) which 
correspond to p\ and p 2 . Indeed, the constant L{K,p\,p-x) depends on p\ and p 2 - We are not 
aware of any other covariance estimation procedure which is really adaptive the smallest and the 
largest eigenvalue of the matrix. 



Finally, 9 pi exhibits the same adaptive properties with respect to the Frobenius norm. 



INRIA 



Estimation of stationary Gaussian fields 



21 



Corollary 6.4. Under the same assumptions as Corollary W. 

\c(e pi )-c(6)\ 



Sup Eg 

0ee+, P1 nW(p 2 ) 



< L (K, p\ , pi ) inf sup E 

eee+, P1 nw(p 2 ) 



\C{ff)- C{6)\ 



Proof of Corollary \6.4\ As in the proof of Corollary 13.21 we observe that 



11(7(00 -C(0 2 )|| 



> 



P 2 Pi 



if 9 satisfies Assumption (Hi). We conclude by applying Proposition 16.21 and Corollary 



□ 



6.2 Adapting to the decay of the bias 

In this section, we prove that the estimator 9 Pl is adaptive to a range of sets that we call pseudo- 
ellipsoids. 

Definition 6.5 (Pseudo-ellipsoids). Let (aj)i<j<card(M!) be a non-increasing sequence of positive 
numbers. Then, 9 £ G + belongs to the pseudo-ellipsoid £{a) if and only if 



Card(M 



E 



1 var (X[o,o]\X^^ mi _^) - var e (X[o,o]\X^^) 



< 1 



(38) 



Condition (|38| measures how fast Vare(X[o,o]|X^/( m .)) tends to Vare(X[o,o]|XA\{(o,o)})- Suppose 
that Assumption (H 2 ) defined in <f27|l is fulfilled. By Corollary 14. 2\ Varg (X[o,o]\Xfj( m .^ is the sum 
of l{9 mi , 0) and a 1 and Condition f38|) is equivalent to 



Card(Xi) 

E 



l (6 



< 1 . 



(39) 



Hence, the sequence (<Zj) gives some condition on the rate of decay of the bias when the dimension 
of the model increases. These sets £ (a) are not true ellipsoids. Nevertheless, one may consider them 
as counterparts o f the cla ssical ellipsoids studied in the fixed design Gaussian regression framework 
(see for instance [Mas07l | Sect. 4. 3). 

To prove adaptivity, we shall need the equivalence between Conditions ([38)1 and ([39| . This 
equivalence holds if Varg (X[o,o]|X_^( m .)) decomposes as l(9 mi ,9) +cr 2 , for any model m £ M.\. As 
mentioned earlier, Assumption (H2) is sufficient (but not necessary) for this property to hold. This 
is why we restrict ourselves to study sets of the type £(a) n Bi(O p , 1). We shall also perform the 
following assumption on the ellipsoids £ (a) 

„2 



(H») : 



for any 1 < i < \Mi\ 



It essentially means that the sequence (aj converges fast enough towards 0. For instance, all the 
sequences Oj = <j{d mi )~ s with s > 1/2 satisfy (H a ). 

Proposition 6.6. Under Assumption (H Q ), the minimax rate of estimation on £(a) D Si(0 p , 1) n 
U{2) is lower bounded by 



inf sup Eg 

8 eG£(a)nBi(0 p ,l)nW(2) 



> L sup 

l<i<Card(Mi 



2.2 dmi 
d; hO 5 



(40) 
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This lower bound is analogous to the minimax rate of estimation for ellipsoids in the Gaussian 
sequence model. Gathering Theorem 13. H and Proposition 16.61 enables to derive adaptive properties 
for 



'pi ■ 



Proposition 6.7. Considering K > Kq, p\ > 2, p 2 > 2 and the collection Mi, we define the 
estimator 9 Pl with the penalty pen(m) = Ka 2 p\p 2 j^ . For any ellipsoid £{a) that satisfies (H a ) 

and such that a\ > l/(np 2 ), the estimator 9 P1 is minimax over the set £{a) fl 2?i(0 p , 1) C\U{p 2 ): 



sup Eg 

ee£(a)nBi(o P a)n«(p 2 ) 



7<i • 



< L(K, p 1 ,p 2 )hxf sup E fl 

8 se£(a)nei(o p ,i)nw(p 2 ) 



(41) 



Let us first illustrate this result. We have mentioned earlier, that Assumption (H a ) is satisfied 
for all sequences cii = a(d mi )~ s with s > 1/2. We note £'(s) such a pseudo-ellipsoid. By Propo- 
sitions H3] and [H3 the minimax rate over one pseudo ellipsoid £'(s) is cr 2 (np 2 )~ 2s ^ 1+2s \ The 
larger s is, the faster the minimax rates is. The estimator 9 P1 achieves simultaneously the rate 
a 2 (np 2 )~ 2s/ '( 1 + 2s ) for all s > 1/2. Consequently, 9 P1 is adaptive to the rate s of decay of the bias: 
it achieves the optimal rates without knowing s in advance. 

Let us further comment Proposition 16.71 By lj4lj) . the estimator 9 Pl is adaptive over £(a) n 
$i(O p ,l) C\U(p2) for all sequences (a) such that (H a ) is satisfied and such that a 2 > l/(np 2 ). 
Again, the result applies for any n and any p. The condition a\ > l/(np 2 ) is classical. It ensures 
that the pseudo-ellipsoid £ (a) is not degenerate, i.e. that the minimax rates of estimation is 
not smaller than a 2 /(np 2 ). We have explained earlier that we restricts ourselves to parameters 
9 in Si (0 P , 1) only because this enforces the equivalence between ij38j) and (J39j) . In contrast, the 
hypothesis (^ max (E) < a 2 p2 is really necessary because we fail to be adaptive to p2- 

Corollary 6.8. Under Assumption (H ), the minimax rate of estimation over £ (a)nt/(2)n£>i(0 p , 1) 
is lower bounded by 



inf sup ] 

e ee£(a)nB 1 (o p ,i)nu(2) 



\C{9)-C{9) 



> L sup 

l<i<Card(Mi) 



a 2 p 2 A 



Under the same assumptions as Proposition\6_ 

E 



sup 

0e£(a)nBi(o p ,i)n«(p 



< L(K, p 1 ,p 2 )ini sup E fl 

8 8e£(a)nB 1 (0 p S)nU(p 2 ) 



C{9)-C(9)\ 

Proof of Corollary IOI As in the proof of Corollary 13.21 we observe that 



\C(9)-C{9)\ 



||C(0i) - C(9 2 )\\ F > p 2 [ip max (Y,)]~ 1 l(9i 



> 



P 



\C(Bx) - C(9 2 )\\ F < p 2 [ip m UT,)]- 1 l(9 1 ,9 2 ) < p 



P2& 

cm 



2) , 



l{9^9 2 )< P -^-l(9^ 
o*- o~ 

if 9 S Ki(O p , 1) n B op (p 2 ). We conclude by applying Proposition 16.61 and Proposition 16.71 



□ 



Again, 9 P1 satisfies the same minimax properties with respect to the Frobenius norm. All these 
properties easily extend to isotropic fields if one defines the corresponding sets £ lso (a) ni3i(0 p , 1) n 
U(p 2 ) of isotropic GMRFs. 
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7 Discussion 

7.1 Comparison with maximum likelihood estimation 

Let us first compare the computational cost the CLS estimation method and the maximum likeli- 
hood estimator (MLE). For toroi dal lat tices, fast algorithms based on two-dimensional fast-Fourier 
transformation (see for instance [RT02l p allow to compute the MLE as fast as the CLS es timator 
More details on the computation of the CLS estimators for toroidal lattices are given in |Ver09al | 
Sect. 2. 3. When the lattice is not a torus, the MLE becomes intractable because it involves the 
optimization of a determinant of size p 2 . In contrast, the CLS criterion 7n,p(-) defined in (fT6|) is 
a quadratic function of 9. Consequently, CLS estima tors are still computationally amenable. We 
extend our model selection to non-toroidal lattices in |Ver09aj . 

Let us compare the risk of CLS estimators and MLE. Given a small dimensional model to, 
the risk of the parametri c CLS e stimator and the parametric MLE have been compared f rom an 
asymptotic point of view ( |Guv95l | Sect. 4. 3). It is generally accepted (see for instance Cressie [Cre93| 
Sect. 7.3.1) and that parametric CLS estimators are almost as efficient as parametric MLE for the 
major part of the parameter spaces 8+. We have non-asymptotically assessed this statement in 
Proposition 16.21 by minimax argument s. Nev ertheless, for some parameters 9 that are close to the 
border of 0+ , Kashyap and Chellappa KC84| have pointed out that CLS estimators are less efficient 



than MLE. If we have proved nonasymptotic bounds for CLS-based model selection method, we 
are not aware of any such result for model selection procedures based on MLE. 



7.2 Concluding remarks 

We have developed a model selection procedure for choosing the neighborhood of a GMRF. In 
Theorem 13.11 we have proven a nonasymptotic upper bound for the risk of the estimator 9 pi with 
respect to the prediction error £(., .). Under Assumption (Hi), this bound is shown to be optimal 
from an asymptotic point of view if the support of 9 belongs to one of the models injhe collection. 
If Assumption (Bfe) is fulfilled, we are able to obtain an oracle type inequality for 9 Pl . Moreover, 
9 P1 is minimax adaptive to the sparsity of 9 under (Hi). Finally, it simultaneously achieves the 
minimax rates of estimation over a large class of sets £(a) if (H2) holds. Some of these properties 
still hold if we use the Frobenius loss function. The case of isotropic Gaussian fields is handled 
similarly. 

However, in the oracle inequality (|32| and in the minimax bounds (J37j and (jUJ, we either 
perform an assumption on the l\ norm of 9 or on the smallest eigenvalue of (I p 2 — C(9)). When 
||0||i tends to one or <p m i„{I p 2 — C(9)\ tends to 0, there is a distortion between the upper bound 
Ee[Z(0 pi ,0)] provided by Theorem 13.11 and the lower bounds given by Corollary 14.61 or Proposition 
16.21 This limitation seems intrinsic to our penalization method which is linear with respect to the 
dimension, whereas the asymptotic variance term 'Eg[l(9 m _ Pll 9)] depends in a complex way on the 
dimension of the model m and on the target 9. In our opinion, achieving adaptivity with respect 
to the smallest eigenvalue of (I p z — C{9)) (or equivalently the largest value of S) would require a 
different penalization technique. Nevertheless, we are not aware of any procedure in a covariance 
estimation setting that is adaptive to the largest eigenvalues of S. 
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So far, we have provided an estimation procedure for (I p 2 — C(6)) = cr 2 E _1 . If we aim at 
estimating the precision matrix S^ 1 , we also have to take into account the quantit y a 2 . It is 
natural to estimate it by a 2 := l n ,p 2 {8pi) as done for instance by Guyon in [Guy95l ] Sect. 4. 3 in 
the parametric setting. Then, we obtain the estimate := (J 2 (I p 2 — C(8 Pl ). It is of interest to 
study the adaptive properties of this es timator with respect to loss functions such as the Frobenius 
or operator norm as done in RBLZQgl ] in the non-stationary setting. Nevertheless, let us mention 
that the matrix is not necessarily invertible since the estimator 9 Pl belongs to the closure of + . 

The choice of the quantity p\ is problematic. On the one hand, p\ should be large enough so 
that Assumption (Hi) is fulfilled. On the other hand, a large value of p\ yields worse bounds in 
Theorem l3.ll Moreover, the largest eigenvalue of (I p 2 —C(9)) is unknown in practice, which makes 
more difficult the choice of p\. We see two possible answers to this issue: 

• First, moderate values of pi are sufficient to enforce (Hi) if the target 9 is sparse as illustrated 
in Table [2 

• Second, we believe that the bounds for the risk are pessimistic with re spect to p\. A future 
direction of research is to derive risk bounds for 9 pi with pi = +oo. In |Ver09a| . we illustrate 
that such a procedure gives rather good results in practice. 



In Theorem 13. 1\ we only provide a lower bound of the penalty so that the procedure performs 
well. However, this bound depends on the largest eigenvalue of £ which is seldom known in practice 
and we did not give any advice for choosing a "reasonable" constant K in practice. This is why 
we intr oduce in [Ver09a| a data-driven method based on the slope heuristics of Birge and Massart 



BM07I | for calibrating the penalty. We also provide numerical evidence of its performances on 
simulated data. For instance, the procedure outperforms variogram-based methods for estimating 
Matern correlations. 

We have mentioned in the introduction that the toroidal assumption for the lattice is somewhat 
artificial in several applications. Nevertheless, we needed to neglect the edge effects in order to 
derive non asymptotic properties for 9 P1 as in Theorem 13.11 In practice, it is often more realistic 
to suppose that we observe a small window of a Gaussian field defined on the whole plane 1? . The 
previous nonasympto tic pro perties do not extend to this new setting. Nevertheless, Lakshman and 
Derin have shown in that there is no phase transition within the valid parameter space for 



GMRFs defined on the plane 1? . In short, this implies that the distribution of a field observed in 
a fixed window of a GMRF does not asymptotically depend on the bound condition. Therefore, 
it is reasonable to think that our estimation procedure performs well if it was adapted to this 
new setting. In |Ver09al |. we describe such an extension and we provide numerical evidence of its 
performances. 

7.3 Possible extensions 

In many statistical applications stationary Gaus sian fie lds (or Gaussian Markov random fields) 
are not directly observed. For instance, Aykroyd Avk9§l | or Dass and Nair DN03| use compound 



Gaussian Markov random fields to account for non stationarity and steep variations. The wavelet 
transform has emerged as a powerful tool in image analysis , the wave let coefficients of an image are 
sometimes modeled using hidden Markov models CNB98I . PSWS03| . More generally, the success 
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of the GMRF is mainly due to the use of hierarchical models involving latent GMRFs |RMC09l |. 
The study and the implementation of our penalization strategy for selecting the complexity of the 
latent Markov models is an interesting direction of research. 



8 Proofs 

8.1 A concentration inequality 

In this section, we prove a new concentration inequality for suprema of Gaussian chaos of order 2. 
It will be useful for proving Theorem 13. II 

Proposition 8.1. Let F be a compact set of symmetric matrices of size r, (Y 1 , . . . ,Y n ) be a 
n-sample of a standard Gaussian vector of size r, and Z be the random variable defined by 

Z := suptr [i?(FF-/,.)] . 

RGF 



Then 

¥(Z > E(Z) +t)< exp 
where the quantities B and W are such that 



LiE(W) 1 x L 2 B 



A 



(42) 



2 

B := — SUp <Pmax{R) 

n r(zf 



W := - sup tr(RYY*R') . 
n ReF 

The main argument of this proof is to transfer a deviatio n inequ ality for suprema of Rademacher 
chaos of order 2 to suprema of Gaussian Chaos. Talagrand [Tal96j has first given in Theorem 1.2 a 
concentration inequality for such suprema of Rademacher chaos. Boucheron et al. BBLM05| have 



recovered the upper bound applying a new methodology based on the entropy method. We adapt 
their proof to consi der non- necessarily homogeneous chaos of order 2. More details are found in the 
technical appendix [Ver09bl | . 

8.2 Proof of Theorem [331 



Proof of Theorem \3.1\ We only consider the case of anisotropic estimators. The proofs and lemma 
are analogous for isotropic estimators. We first fix a model m G M. By definition, the model fh 
satisfies 

7n,p(0 Pl ) + pen(m) < j„ t p(9 m , p J + pen(m) . 

For any 9' G 9 + , 7„ iP (#') stands for the difference between ~f n ,p(0') and its expectation j{0'). Then, 
the previous inequality turns into 

7(^i) < 70WJ + 7 7r hP (0 m , P i) -7n,p(0*>i) +pen(m) - pen(m) . 
Subtracting the quantity "f{9) to both sides of this inequality yields 

) - ln, P (°Pi) + pen(m) - pen(m) . (43) 
The proof is based on the control of the random variable 7„ p (#m,pi) — 7n, P (^pi)- 
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Lemma 8.2. For any positive number a, and S > 1 the event Sl^ defined by 

ln,p (0m,* ) - 7n,p (^) < ^ , 0) + ^ (0 m , Pl , 0) 

(1 + a/2) (d m + d^) + 



ft e = 



satisfies 

P(f^) < cxp i -L x £ 



A Wn 



^ exp < -Li\fd n 



a of 

A 



m'&M. 



y/l + a/2 l + a/2 



A similar lemma holds in the isotropic case. In particular, we choose a = (K — K )/K and 
6 = 1/(1 + a)/(l + a/2). Lemma HT2l implies that on the event fif, 



7n,p(^m,pi) 7n,p ( ^Pl 



< 



+ pen(m) + 



X e^(a) 2 p?Vmax(S) 



l(O m , Pl ,Q) +pen(m) 



np 2 (5(a) — 1) 
Thus, gathering this bound with inequality (|43|) yields 

^'tnr^Wpi' ) ^ \l + 6(a)- 1 / 2 (6(a) 1 / 2 -l)- 1 ]l(6 m , pi ,6)+2ven(m) 



5(a)V2 



ffo£ 3 P?¥W(E)a(a)' 



np 2 (S(a) — 1) 

with probability larger than 1 — P(^). Integrating this inequality with respect to £ > leads to 



5(a) 1 / 2 - 1 
5(a) 1 / 2 



En 



I [0 Pl ,9 



< 



1 + <S(a)- 1 / 2 ((5(a) 1 / 2 -l)" 



Zl 



2pen(m) 



5(a) 2 L(a) p 2 ^ max (S) 



(5(a) - 1) 



1+572 



A n 



(44) 



We upper bound [(a 2 /(l + a/2)) A n}- 1 by.[(a 2 /(l + a/2)) A l]" 1 . Since a = it follows that 



3, 



I (0 P1 ,0)] < Li(K) [l(6 m , Pl ,e)+pen(m)]+L 2 (K) 



np* 



Taking the infimum over the models m € M. allows to conclude. 



□ 



Proof of Lemma Unil Throughout this proof, it is more convenient to express the quantities 7 np (.) 
and Z(.) in terms of covariance and precision matrices. Thanks to Equation lfT9|) . we also provide a 
matricial expression for 7(.) : 

7 (0') = Itr [(I C{9')) £ (I - C(0'))] . (45) 
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Gathering identities f45|) and ifTTj) . we get 



T ra ,p(^™,pi) -7„, P (^i) = -2*?" 



Ip2 — C(6* miPl )] 



X v X v* _ E 



Since the matrices E, (I p 2 — C(9 m . Pl )), and (/ p 2 —C(8 Pl )) correspond to covariance or precision 
matrices of stationary fields on the two dimensional torus, they are symmetric block circulant. By 
Lemma TA.H they are jointly diagonalizable in the same orthogonal basis. In the sequel, P stands 
for an orthogonal matrix associated to this basis. Then, the matrices C{9 m>Pl ), C(6 Pl ), and E 
respectively decompose in 



C(0„ 



P*D(9 m , Pl )P, C(d Pl ) = P*D(9 Pl )P, E - P*D^P, 



where the matrices D{Q m , Pl ), D(6 Pl ), and are diagonal. Let the p 2 x n matrix Y be defined by 
Y := \/E _1 X v . Clearly, the components of Y follow independent standard normal distributions. 
Gathering these new notations, we get 

7n,p(^TO I( oi) — 7n,p(@pi) = 

ijtr ([If - D(6 m . pi )] 2 - 



D(e pi ) 



D r . YY* - I„ 



(46) 



Except YY* all the matrices in this last expression are diagonal and we may therefore commute 
them in the trace. 



Let < ., . >h and < ., . >w be two inner products in the space of square matrices of size p 2 
respectively defined by 

tr(A*Y,B) A tr(A*DxB) 

< A,B> n := —^— 9 '- and < A,B > w := — y - . 

p z p z 

This first inner product is related to the loss function .) through the identity 



l(B',0) = \\C(0') - C(9)\\ 



n 



Besides, these two inner products clearly satisfy ||C(^')IIh = f° r anv @' *= ® + - Gathering 

these new notations, we may upper bound (|46|) by 



7n, P (0™,pJ 7 n , P (0 Pl ) < \\[If - D(9 m , Pl )} 2 - [If - D(9 P1 )} 2 \\ H , X 



sup 

II I' 2 - D(«l)] 2 - [I 2 - D(8 2 )] 2 || H , < 1 



[If - D{0x)}- [V - D (^)} . [ YY * - V])g7) 
The first term in this product is easily bounded as these matrices are diagonal. 



[If - D(6 m , pi )} 2 - [If - D(6 P1 ] 



\w 



tr 



[If~D(e m , Pl )] 2 -[If-D(8 Pl )f 



tr 



D(0 m , Pl ) - D(6 P1 )\ 2 ^ [21 f - D(6 m , Pl ) - D(9 P1 ) 



1/2 



2If-D(9 mtP1 )-D(e pi ) \\D(9 m:P1 ) - D(6 P1 )\\ W . (48) 
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Since m ,pi and 9 Pl respectively belong to 6^ pi and 0i , the largest eigenvalues of the matrices 
ip2 — C(6* m pi ) and / p 2 — C{9 Pl ) are smaller than pi. Hence, we get 



2I p 2 - D{ 



,pi) - ) = "Aiiax [V - C(^m,pJ] + fmax Ip 2 - C(0 pi ) 



< 2pi 



Let us turn to the second term in (|47|) . First, we embed the set of matrices over which the supremum 
is taken in a ball of a vector space. For any model m' £ M, let U m i be the space generated by the 
matrices D{9') 2 and D{9') for 9' £ & m '- In the sequel, we note d m i2 the dimension of U m '. The 
space U m , m > is defined as the sum of U m and U m > whereas d m 2 m ,2 stands for its dimension. Finally, 
we note B^ 2 m ,2 the unit ball of U m , m ' with respect to the inner product < | >u' ■ Gathering these 
notations, we get 



sup (R,YY* - I p 2) n , < sup — tr [RD^ (YY* - I p2 )] 

R=[l-D(e 1 )] 2 -[I P 2-D(e 2 )} 2 , 
8i e e m ,6 2 £ 9™ and \\R\\ H > < 1 



flee"; = a p 



Applying the classical inequality ab < Sa 2 + 5 1 b 2 /4 and gathering inequalities (|47| and (|48| yields 



< 



5- 1 \\C{0 m ^)-C{0 Pl )\\ 2 n + p\8 sup -tr 2 [RDv (YY* - I p 2)] 

R ^ B S «,» P 



(49) 



For any model m' € X, we define the random variable Z m > as 



sup — tr [RDx {YY* -7 p2 )] 



The variables Z m ' turn out to be suprema of Gaussian chaos of order 2. In order to bound Z,f,,, we 
simultaneously control the deviations of Z m ' for any model mf £ M thanks to the following lemma. 

Lemma 8.3. For any positive numbers a and £ and any model ml £ M. 

2^ maa; (E) 



Zm / ^ 



, / m 

L2 V d m ' 



< 



exp 



a/1 + a/2 1 + a/2 



y/l + a/2 



A Vn 



This result is a consequence from a general concentration inequality for suprema G aussian chaos 
of order 2 stated in Proposition 18. II Its proof is postponed to the technical appendix [VerOQbj. Let 
us fix the positive numbers a and £. Applying Lemma HO to any model ml £ M, the event fi^ 
defined by 



fit — < Zfa < 



2Vmax(S) 
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satisfies 

P(^ c ) < cxp \ -L x i 



y/1 + a/2 
From inequality l(49|) . it follows that 



A Wn 



^ exp < -L 2 \fd~n 



a or 

A 



m'EM 



Vl + a/2 1 + a/2 



7n,p(fT B ,pJ-7n lP (^) < ^ 1 llc(e m , P1 )-C7(e pl )|^ + 2 ^^ (s) 



conditionally to fi^. By triangle inequality, 

- c(e pi )\\ n < \\C{9 m . Pl ) - C(9)\\ n + \\C(9 Pl ) - C(8)\\ H . 

We recall that the loss function I (9', 9) equals \\C{9') — C(0)||^. We apply twice the inequality 
(a + b) 2 < (1 + (3)a 2 + (1 + /3 _1 )6 2 . Setting the first /3 to y/5 - 1, it follows that 



V5 



7n,p(^m,pi ) — Jn,p(®Pi ) < T/f Pl ' ^ " T 



2<5/3iVmax(S) 



[cU^Cl + /?)(! + «/2) + e 2 (i + 



By definition of U m ,fn^ its dimension d m 2 is bounded by d m 2 + d^i. Choosing j3 = 6 — 1 yields 

1 . ~ \/£ 



7n,p(^m,/t»i) - 7n,p(^i) ^ -^=/(6> pi , 6>) + ^ _ 1 *("m,Pi ' ' 



(50) 



H 2 i"™ 2 + a / 2 ) + "™ 2 + a / 2 )J + 277 — 7T~ ■ 

np z np (d — 1) 

To conclude, we need to compare the dimension d m >2 of the space U m > with d m < ■ 
Lemma 8.4. For any model m e M., it holds that 

^m 2 — Ld m , 

where L is a numerical constant between 4 and 5.48. 

The proof is postponed to the technical appendix Ver09b| . Defining the universal constant 
Ko ■= 2L, we derive from {BUI that 

~ 1 ~ \/8 



K 8 2 p\ip max (Y) 



d m (l + a/2) + d^l + a/2) + 



5-1 



with probability larger than P(f2t). The isotropic case is analogous if we replace d m by d™. □ 
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8.3 Proofs of the minimax results 

Let us first prove a minimax lower bound on hypercubes C m (0' , r). We recall that these hypercubes 
are introduced in Definition 16.11 



Lemma 8.5. Let m be a model in M.\ that satisfies d m < ^/np and let 6' be a matrix in m fl 
2?i(0 p , 1). Then, for any positive number r such that (1 — — 2rd m ) is positive, 



inf sup Eg 

eeGo[C m (9' ,r)] 



1(6,6 



> La r A 



1 - 



np* 



where Co[C m (9',r)] denotes the convex hull of C m (6' ,r) . Similarly, let m be a model in M.\ such 
dfn — y/nP an d let 6' be a matrix in O 1 ^ n Bi(0 p , 1). Then, for any positive number r such that 
(1 - || 0' ||i - 8rd™>) is positive, 



inf sup Kg 

e eeCo[c^°(6»',r)] 



16 



> La 2 r A 



'np* 



Proof of Proposition \6.2\. The first result derives from Lemma l875l applied to the hypercube C m (0 p , (np 2 y 
We prove the second result using the same lemma with C m [6' , (1 — H^Hi)/ (v^ip)]- D 



Proof of Lemma \8.5\ This lower bound is based on an application of Fano's approach. See Yu97l | 
for a review of this method and comparisons with Le Cam's and Assouad's Lemma. The proof 
follows three main steps: First, we upper bound the Kullback-Leibler entropy between distributions 
corresponding to 6\ and 62 in the hypercube. Second, we find a set of points in the hypercube well 
separated with respect to the Hamming distance. Finally, we conclude by apply ing Birge's version 
of Fano's lemma. More details can be found in the technical appendix Ver09bl | . 



□ 



Proof of Proposition ^. 61 First, observe that the set £(a)f]Bi(0 p , 1/2) is included in £(a)n£>i(0 p , l)n 
U{2). We then derive minimax lower bounds on £(a) fl i3i(0 p ,l/2) from the lower bounds on 
hypercubes. 

Let rfii be a model in M. x such that d m is smaller than y/np. Let us look for positive numbers 
r such that the hypercube [C mi (Q p ,r)] is included in the set £(a) n £>i(0 p , 1/2). 

Lemma 8.6. Let m be a model in M.\ and r be a positive number smaller than l/(4d m ). For any 
6 Co[C m (0 P) r)], 

var e (X[o,o}) < a 2 (l + 16d m r 2 ) . 
The proof is postponed to the technical appendix |Ver09bl |. If we choose 



lQoJd^ 
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then 2rd mi is smaller than 1/8 by assumption (H Q ). Applying Lemma 

Varg (X[o,o]) < a 2 + af. Hence, we get the upper bound 

J2)=i [Var (Jf [o,o]|X m ^_ 1 ) - Var (X[o,o]|X mj )] < a 2 and it follows that 



we then derive that 



0a ^ <l) Var(X[o,o]|X n 

3=1 



Var (X[o,o]|X m ^ 



< 1 



since the sequence (aj)i<j<card(Xi) ls 11011 increasing. Consequently, Co[C m (O p ,r)] is a subset of 
£(a) n Si (0 P , 1/2). By Lemma HT5l we get 



inf sup 

e eef(o)nBi(o p ,i/2) 



> La; A 



16cr 2 

<T 2 d, 



np z 



Considering all models m G Mi such that d m < \fnp yields 



inf sup Eg 

8 0e£(a)nB 1 (O p S/2) 



> L sup 

i<Card(JHi), d mi <^/np 



af A 



cr 2 d„ 
np 2 



(51) 



(52) 



If the maximal dimension d mcard{Mi) is smaller than \/np, the proof is finished. In the opposite 
case, we need to show that the supremum lj40)l over all models me M.\ is achieved at some model 
m of dimension less than -^/np. 

Lemma 8.7. For any integer 1 < i < Card(M.\) — 1, the ratio d mi+1 /d mi is less than 2. 

The proof of Lemma 18771 is postponed to the technical appendix |Ver09b| . Let i' be the largest 
integer such that d m ., < yfnp. Since i' is smaller than Card(Ali), we know from Lemma \8. 71 that 
^/np/2 < dm., < y/np. By assumption (H a ), a 2 is smaller than a 2 /dm.,. Gathering these bounds 
yields 



af, < 



< 



4d m .,a 2 



np^ 



Since the sequence (<2i)i<j<card(Ali) is non increasing, the supremum (|40f over all models in A4\ 
is either achieved for some i < i' or is smaller than 4(a 2 A a 2 d m ., /(np 2 )). □ 



Proof of CorollarvKM Observe that Co[C m (0 p , l/(4d m )] is included in 6 m nSi(0 p , 1/2). This last 
set is itself included in 9+ pi nW^). Applying Lemma [HHJ we get the following minimax lower 
bound 



inf 



sup 

eee+, P1 nw( /92 ) 



E 



lie, e 



> La 2 - 



np^ 
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since the dimension d m is smaller than np 2 . Applying Theorem 13. H we derive that 



sup 

dee+, P1 nu(p 2 ) 



< L(K)a 2 p(p 2 



< L(K,p 1 ,p 2 )a 2 -^. 



np* 
np* 



L 2 {K) 



np 2 



sup 

0ee+, P1 nW(p 2 ) 



We conclude by combining the two different bounds. 



□ 



Proof of Proposition \6.7[ This result derives from the upper bound of the risk of 9 P1 stated in 
Theorem 13.11 and t he minim ax lower bound stated in Proposition 16.61 For details, we refer to the 
technical appendix [Ver09bl | . 

□ 



8.4 Proofs of the asymptotic risk bounds 

Proof of Proposition \4-4\ This result is closely related to Proposition 4.11 in [Guv95l |. In fact, we 
extend his proof to stationary fields on a torus. In the sequel, we shall only consider non-isotropic 
GMRFs, the isotropic case being similar. Let us fix a model m in the collection M.\ and let us 
assume (Hi). 

We define the d m x p 2 matrix x m as 

(x v m T :=([C(*w>m, k = l,...,d m ) ■ 

For any (i, j) £ {1, . . . , p} 2 , the (i— l)p + j-th row of x m corresponds to the list of covariates used 
when performing the regression of X[i,j] with respect to its neighbours in the model m. Contrary 
to the previous proofs, we need to express the n x p 2 matrix X v in terms of a vector. This is why 
we define the vector X v of size np 2 as 

X v [p 2 0-i)+ P (i 1 -i)+i 2 ] := X j [i u i 2 ] , 

for any € {1, . . . ,p} 2 and any j < n. Similarly, let Xm be the d m x np 2 matrix defined as 

Xm[k,P 2 U-^)+P(ii- 1 )+i2] ■= X J m b('i-l)+'2] , 

for any (ii,^) € {1> ■ ■ • ,p} 2 and any j < n. 

We are not able to work out directly the asymptotic risk of 9 m ,pi ■ This is why we introduce 
a new estimator 9 m whose asymptotic distribution is easier to derive. Afterwards, we shall prove 
that 

@m &nd $m,pi have the same asymptotic distribution. Let us respectively define the estimators 
a rn in R dm and # m as 

dm := ((xlY Xl)' 1 X^X V (53) 

d m 
fc=l 

where we recall that ( 1 $i ly j 1 , . . . ^i dm ,j dm ) is a basis of 8 m . Obviously, 6 m is a Conditional least 
squares estimator since it minimizes the expression (fl6|l of 7n, P (.) over the whole space 9 m . Con- 
sequently, m coincides with 9 m . pi if m belongs to 6+ pi . 
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For the second result, we assume that Assumption (Hfe) holds. Applying Corollary 14. 2\ we know 
that for any (k, I) <G A, X[k,l] decomposes as 

X[k,l] = ^2 Qm, Px [i,3]X[k+i,l+i] + e m \k,l] , (54) 

where e m [k,i] is independent from {X[k+i,i+j], G m}. For the first result, the same decompo- 

sition holds since 9 is assumed to belong to 0+ and 9 m , Pl therefore equals 9. 

Let a m £ M. dm be the unique vector such that 6 m>Pl = T^kZ\ a m[kY^i k .j k . Then, the previous 
decomposition becomes 

Gathering this last identity with (|53| yields 

1 / V\t v\ ( \ V V 
o \X.m) Xm I I o ^m^m 

np s J \ np z 

where the vector ef n of size np 2 corresponds to the n observations of the vector . When n goes to 
the infinity, l/(np 2 )(xl)*xl converges almost surely to the covariance matrix V by the law of large 
numbers. By definition, the variable e m [i,j] is independent from the (i— l)p + jth row of Xmly']. It 
follows that Ee(x^e v ) = 0. Applying again the law of large numbers we conclude that a m converges 
almost surely towards a m and that 9 m converges almost surely towards #m, Pl . Besides, the central 
limit theorem states that the random vector l/(^/np)Xm eV converges in distribution towards a zero 
mean Gaussian vector whose covariance matrix equals l/p 2 Varg (Xm e m)- By decomposition (|54p . 
e v m = (I-C(9 m ^ 1 ))X v while the k-th row of \ v m equals [C(^ lkJk )X v \* . Thus, for any 1 < k, I < d m , 

^Var e ( x ^)[ M = -^cov e [(X v )*C($ iktjk ) [I - C(9 m . Pl )} X v , (X v )*C(^ tl:]l ) [I - C(0m, Pl )] X v ] . 

As the covariance matrix of X v is a 2 (I — C(9)) 1 , we obtain by standard Gaussian properties 

^Var e (xL e m) [Ml = 



2<7 4 

—5- cov e 



[7 - C^)]" 1 C(* <fcA ) [I - C(9 m , Pl )] [I - C{e)]~ l C(V itJl ) [I - C{0 m>Pl )] 



By Lemma lA~Tl all these matrices are diagonalizable in the same basis and therefore commute with 
each other. We conclude that p-Varg (Xm e m) = 2<r 4 W / and 



As 8m, pi belongs to 0+ pi , there exists a unique vector a m € M. dm such that 9 mpi — Ylk=i a m[ lt ]^i k ,jk ■ 
The matrix 9 m _ pi belongs to the open set 8^ ft for the two cases of the propositions. Indeed, m , pi 
equals 9 in the first situation. In the second situation, this is due to the fact that 9 satisfies (H2) 
and to Lemma |4~T1 

Since 9 m converges almost surely to m , P i, the matrix 9 m belongs to m with probability going 
to one when n goes to infinity. If follows that the estimators a m and a m coincide with probability 
going to one. By Slutsky's Lemma, we obtain that 
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Let us express the risk of 9 m>Pl with respect to the distribution of a r . 



l(d m , Pl ,O miPl ^ =Ee ^2(a m [k] - a m [k])tr (^i^^X) 



-k=l 



= tr [V (a m —a m )* (a m - a m )] 



By Portmanteau's Lemma, np 2 l(8 mtP1 , m ,pi) converges in distribution towards a random variable 
whose expectation is tr (WV' 1 ) . In order to conclude, it remains to prove that the sequence 
[np 2 l{9 

m,pn @)]n>i is asymptotically uniformly integrable. 
Let us consider a model selection procedure with the collection M = {m} and a penalty term 
satisfying the assumptions of Theorem [331 Arguing as in the proof of this theorem, we derive from 
identity lf44"l) the following property. For any £ > 0, with probability larger than \ —L\ exp [— L 2 £], 



(£) . 

This clearly implies that the sequence [np l(9 m>Pl , 6* miPl )]„>i is asymptotically uniformly integrable 
and the first part of the result follows. 



For the first result of the proposition, we have stated that 9 equals m . As a consequence, 



lim Eg 

n — >+oo 



= 2aHr [W 1 ] . 



Besides, the term W[fc,i] here equals tr [C( 1 i'i k .j k )C(^i l .j l )]. This last quantity is zero if k ^ I and 



equals \\C(^ k )\\ 2 F if k = I. 



□ 



Proof of Proposition \4- 7\ As 9 belongs to + n B\ (0 P , rj), the largest eigenvalue of S is smaller than 
a /(l — rj). Applying Theorem 13. 1[ we get 



En 



< L(K) inf 



l{8 m , Pl ,9) + K — j- - 



< L(K, V ) inf 



np^ 



Gathering this bound with the result of Corollary 14.61 enable us to conclude. 



□ 



Lemma A.l. There exists an orthogonal matrix P which simultaneously diagonalizes every p 2 xp 2 
symmetric block circulant matrices with p x p blocks. Conversely, if 9 is a square matrix of size p 
which satisfies then the matrix D(&) = PC(9)P* is diagonal and satisfies 

p v 

D(9)[(t-i)p+j,(t-i)p+j] = 6[k ' l] cos ( 2n ( ki /P + 1 J/P)) ( 55 ) 

fe=i i=i 

for any 1 < i,j < p. 
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It is proved as in Sect. 2. 6. 2 to the price of a slight modification in order to take into 

account the fact that P has is orthogonal and not unitary. The difference comes from the fact that 
contrary to Rue and Held we also assume that C{6) is symmetric. 

This lemma states that all symmetric block circulant matrices are simultaneously diagonalizable. 
Moreover, Expression ([55]) explicitly provides the eigenvalues of the C(6) as the two-dimensional 
discrete Fourier transform of the p x p matrix 9. 
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